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Abstract 

We address an issue in the transition from genetic linkage analysis to genetic association analysis: how to 
correctly account for correlations between samples obtained from a pedigree for a case-control analysis. Since 
correlation does not affect the mean of genotype or allele frequency estimation (it only affects the variance), we 
introduce the concept of "effective sample size" to account for this effect. The concept of effective sample size 
much simplifies the handling of complicated relationship between correlated samples. For example, for allele 
frequency estimation, sibpairs and parent-child pairs are equivalent to 1.5 samples, first cousins and uncle-nephew 
pairs are equivalent 1.6 samples, etc., without considering the affection status. For genotype frequency estimation, 
the effective sample size concept is perhaps less convenient because its value depends on a particular genotype and 
depends on the allele frequency. We present the formula for test statistic and 95% confidence interval of odd- 
ratio, the two most frequently used quantities in case-control analysis, for correlated samples using the effective 
sample size. 
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1 Introduction 



The genetic case-control association analysis [7] is one of the most commonly used approaches 
in human disease gene mapping. In this analysis, researchers collect tissues or blood samples 
from patients as well as normal controls. The DNA molecules are extracted from the sample 
for genotyping, either by microsatellite markers or single-nucleotide-polymorphism (SNP) com- 
plete human genome is now technologically feasible (e.g. using chips), and have been routinely 
carried out. The goal of a genetic association analysis is to locate genetic markers that exhibit 
significantly different allele or genotype frequencies between the patient (case) and the normal 
(control) group. 

Another competing paradigm for disease gene mapping is the genetic linkage analysis [fT6l . 
In linkage analysis, families with multiple incidents of a disease are identified, and samples 
are collected for genotyping. For common "complex" human diseases with unknown disease 
etiology and disease inheritance mode, one popular study design for linkage analysis is the non- 
parametric affected sibpair analysis. Many research groups, ours includedll9l [TOl [Tl . started out 
by collecting affected sibpairs for genetic linkage analysis, but now plan to use these patient 
samples in case-control association analysis also. 

The transition from linkage to association analysis faces one statistical problem: for almost all 
statistical tests, samples are assumed to be independent. Three options are available if samples 
are correlated: first, the independence condition is forced by picking one sample per family; 
second, the independence condition is ignored while all correlated samples are included in the 
analysis; third, all correlated samples are included but their correlation is accounted for in the 
analysis. The first option loses samples thus reduces statistical power. The second option, so- 
called naive estimator, is generally unbiased, but underestimates the variance, thus reporting 
incorrect p-values. We focus on the third option, using an approach called "naive estimator with 
effective sample size", to provide a simple solution in handling correlated samples. 

Although the topic of correlated samples in genetic case-control association analysis is not 
new (see, for example, Refs. lfTTl [6l O), our approach deviates from many other publications 
by not using the likelihood method. The reason for this is that we would like to provide an 
easy solution accessible to most practitioners of genetic analysis who may or may not have a 
strong mathematical background. Notice that besides the correlation between two relatives in a 
relative pair, most samples are still independent because they are from different pedigrees. In 
this context, although likelihood method can correctly account for correlation between multiple 
samples in a pedigree, the simple effective sample size approach is an excellent alternative that 
is easily understandable and can be applied using existing analysis programs. 

This article is organized as follows: we reproduce the important result that correlation among 
samples do not bias the allele frequency estimation; then we calculate the variance of genotype 



2 



frequency when a fixed type of relative pairs is used; the concept of effective sample size is 
introduced; the variance calculation is extended to allele frequency estimator; the use of effective 
sample size in test statistic calculation is provided; and the use of effective sample size in 
calculating confidence intervals of odd-ratio calculation is presented. 



2 Correlation Among Samples Does Not Bias the Mean 



Before examining the effect of correlation among samples, we first discuss the quantity which is 
not affected by correlation: the variable mean. Suppose all our samples consist of A^^ correlated 
pairs: {x/, ?//} (i = 1, 2, . . . A^^), where xi and yi are the variable value for individual 1 and 2 in 
pair /, such as the genotype indicator variable. An estimator of the mean using all samples, 
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is identical to the estimator using one sample per pair (note that E[xi] = E[yi]). In other words, 
correlation between samples in a pair does not bias the mean estimator. The key of this proof 
is that correlation between x and y only enters as a product, whereas for the estimation of the 
mean, there is no cross-product term. 

In a slightly more complicated situation, where the samples consist of singletons, correlated 
pairs, and correlated triples. We can write the singletons as {xi} (I = 1,2, ■ ■ ■ Ni), pairs as 
{x^, y'^} (m = 1, 2, ■ ■ ■ A^2), and triples as {x", y", z'^} (n = 1, 2, ■ ■ ■ A^s). The mean of the 
naive allele/genotype counting estimator using all N {N = Ni + 2N2 + •iN^) samples is: 
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again, identical to the estimator using only one uncorrelated sample per group. It can be easily 
seen that the same conclusion holds for other more complicated situations. 

This general conclusion has directly consequence on the estimation of allele frequencies from 
pedigree dataJHIH. In most pedigree analysis programs, both options are available for estimat- 
ing allele frequencies: either from all individuals or from pedigree founders onlyQ It is acknowl- 
edged in Ref. [5] that "one will not go far wrong in simply using the data for all individuals, 
ignoring their relationship". 

' See, e.g., the program FEDMAN AGER, \http://www.broad.rnit.edu/ftp/distnbution/softwaie/pedmanageT/\ 
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If both naive estimator using correlated samples and that using independent samples lead to 
the same mean, why should we worry about the naive estimator? There could be these concerns: 
for example, if singletons and pairs are collected under different conditions, a selection bias 
might be introduced; and, although the means of the two estimators are the same, the variances 
of the two are not. For a particular dataset of finite number of samples, estimator with a larger 
variance may fail to reproduce the true value. We will focus on the variance calculation next. 



3 Variance of Genotype Frequency Estimated from Relative Pairs 

We discuss the following situation: genotype frequencies are to be estimated from a group of 
relative pairs of the same type, e.g., 500 sibpairs. The genetic marker is assumed to have two 
alleles, A and B, with allele frequency p and q, and the three possible genotypes, AA, AB, BB, 
with expected genotype frequencies of p^, 2pq and under Hardy -Weinberg equilibrium. The 
genotype indicator vector [|20| for the first and second relative can be written as Xj and Yj 
are the genotype index). The variance of naive estimator of genotype frequency Gi (?=2,1,0 refer 
to AA, AB, BB genotype) is: 



Var[Gi] = Var 
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Var[X,] Cov[X,,Y,] 

The first term in Eq.© is the variance of genotype frequency estimated from 2Nr independent 
samples. The second term in Eq.© is the extra variance due to the correlation between two 
relatives in a relative pair: 

Cov[X,, F,] = E{X„ F,) - E{X,)E{Y^). (4) 
The cross-covariance matrix Cov[Xi, Yj] can be calculated by the Li-Sacks ITO matrix [fT2l [T3l 

2 

Cov[X„ Y,] = E(X„ Y,) - E{X,)E{Y,) = ^ P(F,|X„ k)P{k)P{Xi) - P(X,)P(F,) 

fc=0 

2 

= ^P(r,|X„A;)7rfcG',-G',G, (5) 

fc=0 

where k (k = 0, 1, 2) is the identity-by-descence (IBD) status between the two relatives, tt^ = 
P{k) is the probability of IBD=k, Gi or Gj is the population genotype frequencies, and P{Yj\Xi, k) 
(A; = 0, 1, 2) are the ITO matrices[fT2l[T3l[T4l[8l: 

IBD=2 IBD=1 IBD=0 
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Pr(IBD=2)=7r2 Pr(IBD=l)=7ri Pr(IBD=0)=7ro kinship-coefficient 
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Table 1 : IBD probability for several types of relative pair. 
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It can be shown that the variance of the naive genotype frequency estimator is: 
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For two unrelated individuals, 7r2 = tti = 0, ttq = 1, the second and the third terms in Eq.© 
cancel, and only the first term remains. 

The expected IBD probability between two relatives of different types is easily available [[TTIl 
and some of them are included in Table[T] Take sibpairs for example, 7r2=l/4, tti =1/2, ttq =1/4, 
and the variances of genotype frequencies by Eq.© are: 
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Variance of genotype frequency in other relative pairs can be derively similarly by inserting 
the IBD probability frmo Table [U to Eq.©. For parent-child pairs and uncle-nephew pairs, the 
results are: 



p [1 — p ) p q 
^ar[G2]par-chiid = +^ 
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The second terms in Eq.(|8ll9l[T0l) are always non-negative (note: 1 — 3pg = {p — q^+pq > 0, 
and 1 — 4pg = (p — qY > 0), indicating that correlation among these relatives always increases 
the variance. We can also show that the second terms in Eq.® are always smaller than the first 
term, indicating that using two relatives always reduce the variance when genotype frequencies 
are estimated from one relative randomly selected from a relative pair. 

4 The Effective Sample Size to Account for the Increase of Variance of 
Genotype Frequencies 

In Eqs. (|7l8l) . we have seen variances of genotype frequencies are higher than those of uncor- 
related samples (with the same number of individuals). For binomial distribution, variance is 
inversely proportional to sample size. Consequently, an increase of the variance due to corre- 
lation can be accounted for by a decrease of the "effective" sample size. We define effective 
sample size as the number of independent samples that lead to the same level of variance as 
calculated from the correlated samples. 

By this definition, and by examining the analytic expression of variances in Eqs. (l7l8l) . it is 
clear that for genotype frequencies, the effective sample size depends on the genotype, as well as 
on allele frequency p. FigHlshows variances of three genotype frequencies in the sibpairs, parent- 
child-pairs, and uncle-nephew-pairs (upper plots, solid lines), and the corresponding variances 
for independent samples {2Nj. individuals) (in dashed lines). The ratio of the two variances is 
shown in the lower part of FigUl 

To get a sense of a typical sample size reduction, we carry out two averaging processes. The 
first is to average the three variance ratios with weight of p^, 2pg, q^ for the three genotypes, 
at a fixed value of p. This average is represented by a black line in Fig {T] (lower part). The 
second average is to average over p (x-axis in FigH]), which leads to ^ 1.414, 1.328, 1.164 for 
the sibpairs, parent-child-pairs, and uncle-nephew -pairs. Since by definition these variance ratios 
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Figure 1 : (Upper) Variances of genotype frequencies for the sibpairs, parent-child-pairs, and uncle-nephew -pairs 
(solid lines). Three colors indicate different genotypes (red: AA, green:AB, blue:BB). Dashes lines are the corre- 
sponding variances calculated from uncorrected samples. (Lower) Variance ratios (solid lines over dashed lines) for 
three different genotypes. The black line is the weighted average of the variance ratio of three different genotypes. 

are equal to 2Nj./{2Nf,), we conclude that sibpairs, parent-child-pairs, and uncle-nephew-pairs 
are equivalent to ~ 1.4, 1.5, 1.7 effective samples, in order to account for the increase of variance 
in genotype frequencies. 



5 Variance of Allele Frequency Estimated from Relative Pairs 



Denote by Ci and Di the number of allele A in the first and the second relative in relative pair 
/, it is clear that C = iXi and D = jYj = 2, 1, are AA, AB, and BB genotypes). The 
variance of allele frequency estimated from all relatives in relative pairs is 
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Using the ITO matrix and the relationship C = iXi and D = jYj, it can be shown tha|§: 

= 4iv^ + m 

pq ^ pqjTTi + 27r2) ^ ^ 4pq(f) ^ + 2(f)) ^^^^ 



ANr 8Nr ANr 8Nr ANr 

where is the kinship coefficient, defined as the probability that an allele selected randomly 
from one person is IBD with a similarly randomly selected allele from another person IfTSlfTTI . 
In absence of inbreeding, the kinship coefficient of IBD=2 relative pairs (e.g. identical twins) 
is 1/2, and that of IBD=1 relative pairs (e.g., parent-child pairs) is 1/4. Combining the two, we 
have 4>= ^712 + \t^\- 

The kinship coefficients of the common relative pairs are listed in Table [H and the increase of 
variance of allele frequency due to related individuals can be obtained immediately: for sib and 
parent-child pairs, the variance is increased from 1 to 1-1-2/4=1.5 (in the unit of pq/{ANr)), and 
for uncle/aunt-nephew/niece and half-sib pairs, the variance is increased from 1 to 1-1-2/8=1.25. 
By definition, these variance ratio is inversely proportional to the effective sample size over the 
actual sample size, so the effective sample size for a sibpair (or parent-child-pair) is 4/3 ~ 1.33, 
and that for a uncle-nephew -pair (or half-sib-pair) is 8/5 = 1.6. Interestingly, these numbers are 
similar but not identical to the averaged effective sample size based on variances of genotype 
frequencies. 



6 Using Effective Sample Size in Chi-square Tests 

The effective sample size method can be applied to modify Pearson's test originally designed 
for independent samples. Denote the allele counts in a 2-by-2 table as A^A,casc, A^B.case, ^A,con, A^b,( 
the Pearson's chi-square test statistic is: 

^2 (^A,case^B,con — ,case ,con ,case + A^A.con + A^B ,con / /I i\ 

{Nxfiasc + A^B,casc) (A^Axon + N -Q, con) {N K, case + A^A,con) (A^B.case + A^B,con) 

If control samples are independent, whereas case samples are from A^^ relative pairs of the same 
type (e.g. all sibpairs) (so A^A.case + ^B,case = 2 ■ 2Nr), we can reduce the apparent allele counts 
^A,casc7 ^Bxase by a factor of a = 1/(H- 20) (e.g., for sibpairs, 2/3) when the affection status is 
ignored. The corrected X"^ test statistic is: 

_ '^(^A,casc^B,con ~ ^B,case^A,con)^(Q^^A,case + ftA^B.case + A^A,con + A^B.con) ^^^^ 
(^A,case + ^B.case) (^A,con + ^B,con) («^A,case + ^A,con) («^B,case + ^B,con) 

It can be shown that X^/ X"^ < 1 (but > a), and the reduced test statistic value leads to a 
larger (less significant) p-value. 

^The same formula was derived in Ref. |5] by exhaustively counting the mating types. 
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Table 2: Genotype counts of a SNP in gene PTPN22 



In a more general situation, the patient group may consist of singletons, sib-pairs, uncle- 
nephew-pairs, etc. The overall effective sample size can be obtained by applying the reduction 
to specific relative pairs. For example, 60 singletons, 10 sibpairs and 10 uncle-nephew pairs lead 
to effective sample size of (ignoring the affection status) 60 + 20 x 2/3 + 20 x 4/5 = 89.33, or 
a reduction of a = 0.8933. 



7 Using Effective Sample Size in Calculation of Confidence Interval of 
Odd Ratio 

Another common task in case-control analysis is to estimate the 95% confidence interval (CI) 
of odd-ratio (OR). The estimation of OR is straightforward: 9 = NA,caseNB,con/ (A^AxonA^B,case)- 
As shown by WoolfEU, the 95% CI of OR can be approximated as: 

where the standard error of the logarithm of 9 can be written in four allele counts: 

^ (I 1 1 1 

^^^°S^)= /V +/V +A^+A^ • ^^^^ 

Similar to our discussion in the last section, when relative pairs are involved in the patient sam- 
ples, the apparent sample size is reduced to the effective sample size by a factor of a, and the 
above formula is modified to 

^ (I 1 11 \'/' 
a,(log^)= — + — + - + - . (17) 

Since a < 1, the standard error 9 is increased, the consequently, the 95% CI of OR is expanded. 



8 Illustration by a Real Dataset 

The data we use here is taken from Ref.[|2l (replication study, all sibs option, in table 1 of Ref.[|2l) 
collected for the NARAC project[|9l[T0lll]|. The control samples are independent, whereas case 
samples consist of both singletons and sibpairs. Discount the 377 sibpairs (377 x 4 alleles) by a 
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factor of 2/3, the effective sample size for the case group is reduced from 1680 to 1 177.33, with 
a reduction rate of a = 0.7008. 

Using Eq.(fT3l), the original X"^ is 53.26, corresponding to p-value of 2.9x10^^'^. With the 
effective sample size, the corrected is 45.75, corresponding to p-value of 1.3 x 10^^^. As for 
the 95% CI of odds-ratio, the original interval is [1.73, 2.61]. After reduce the allele counts in 
case group by a factor of a, the 95% CI of odds-tatio becomes [1.70, 2.66]. In general, for a 
very significant test result, the correction to p-value due to correlation in relative pairs does not 
render the result insignificant, but the exact number for p-value is slightly changed. Same is true 
for 95% CI of odds-ratios. Only for border-line significant result, introducing correlation among 
samples may change the result to be insignificant !, 17,1 . 

9 Discussion 

The effective sample size framework may not be applied to situations where case and control 
samples are taken from the same pedigree lfr9l . The problem is that case and control groups are 
still distinct entities in Pearson's test. Fortunately, using both affected and unaffected individ- 
uals from the same pedigree in a case-control analysis, though remains a theoretical possibility, 
is not commonly practiced. 

Sibship of more than two sibs, or a set of multiple relatives from the same family, can be 
treated in a similar way as sibpairs or relative pairs. Take three siblings for example, the vari- 
ance of genotype/allele frequency will contain three co variance terms, one for each sibpair. The 
effective sample size and sample size reduction can be determined accordingly. 

Adding affection status has a different effect from correlation among samples. For a group 
of affected samples, the genotype frequencies of the linked disease gene are altered to Gj^afr = 
{fi/K)Gi (i = 2, 1, 0), where /2, /i, /o are the penetrance of the three genotypes (AA, AB, BB); 
and the allele frequency for the mutant allele is changed to pA,a.s = {p/K){f2P + fiq). As can 
be seen from Fig HI a change of allele frequency may indeed change the variance ratio as well 
as the corresponding effective sample size. But this effect is small due to the limited range of 
ratio ratio in Fig[T] We plan to carry out future analytic and simulation analysis to confirm this 
conjecture. 

An unexplored idea is to use identical-by-state (IBS) status to calculate the joint probability: 
P{X„ Yj) = J2k PiXjl^i, k)Pr{IBS = k)P{X,). If this indeed can be done, then IBD status 
is not essential to the discussion in this way: it is simply a tool in calculating the covariance. 

In conclusion, the effective sample size provides a complete solution to account for the cor- 
relation between relatives if samples consist of relative pairs in a case-control study. We believe 
this approach is simpler and more intuitive than other mathematically sophisticated methods. 
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